AITopics | interpretability score

Collaborating Authors

interpretability score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

56ed2bd15b66f709cd81cb1aaa0496b9-Paper-Conference.pdf

Neural Information Processing SystemsFeb-13-2026, 18:57:04 GMT

interpretability, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > United States > Nevada (0.04)
(2 more...)

Genre: Research Report > Experimental Study (1.00)

Industry: Government (0.45)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

56ed2bd15b66f709cd81cb1aaa0496b9-Paper-Conference.pdf

Neural Information Processing SystemsOct-11-2025, 00:22:44 GMT

explanation, interpretability, mis, (17 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > United States > Nevada (0.04)
(2 more...)

Genre: Research Report > Experimental Study (1.00)

Industry: Government (0.45)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Binary Sparse Coding for Interpretability

Quirke, Lucia, Shabalin, Stepan, Belrose, Nora

arXiv.org Artificial IntelligenceOct-1-2025

Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse autoencoders (BAEs) and binary transcoders (BTCs), which constrain all activations to be zero or one. We find that binarisation significantly improves the interpretability and monosemanticity of the discovered features, while increasing reconstruction error. By eliminating the distinction between high and low activation strengths, we prevent uninterpretable information from being smuggled in through the continuous variation in feature activations. However, we also find that binarisation increases the number of uninterpretable ultra-high frequency features, and when interpretability scores are frequency-adjusted, the scores for continuous sparse coders are slightly better than those of binary ones. This suggests that polysemanticity may be an ineliminable property of neural activations.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.25596

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Gulko, Alex, Peng, Yusen, Kumar, Sachin

arXiv.org Artificial IntelligenceSep-30-2025

Sparse autoencoders (SAEs) are a promising approach for uncovering interpretable features in large language models (LLMs). While several automated evaluation methods exist for SAEs, most rely on external LLMs. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive evaluation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks without requiring an external LLM judge, achieving over 70% Spearman correlation with results in SAEBench. The official implementation and evaluation dataset are open-sourced and publicly available.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.00691

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Cho, Hakaze, Yang, Haolin, Kurkoski, Brian M., Inoue, Naoya

arXiv.org Artificial IntelligenceSep-26-2025

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs) for interpreting their mechanism. However, they typically rely on autoencoders constrained by some implicit training-time regularization on single training instances (i.e., $L_1$ normalization, top-k function, etc.), without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which we empirically evaluate and leverage to characterize the inference dynamics of LLMs and In-context Learning. (2) Feature untangling. Similar to typical methods, BAE can extract atomized features from LLM's hidden states. To robustly evaluate such feature extraction capability, we refine traditional feature-interpretation methods to avoid unreliable handling of numerical tokens, and show that BAE avoids dense features while producing the largest number of interpretable ones among baselines, which confirms the effectiveness of BAE serving as a feature extractor.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.20997

Country:

Asia (1.00)
North America > United States (0.67)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Demystifying the Accuracy-Interpretability Trade-Off: A Case Study of Inferring Ratings from Reviews

Atrey, Pranjal, Brundage, Michael P., Wu, Min, Dutta, Sanghamitra

arXiv.org Artificial IntelligenceMar-10-2025

Interpretable machine learning models offer understandable reasoning behind their decision-making process, though they may not always match the performance of their black-box counterparts. This trade-off between interpretability and model performance has sparked discussions around the deployment of AI, particularly in critical applications where knowing the rationale of decision-making is essential for trust and accountability. In this study, we conduct a comparative analysis of several black-box and interpretable models, focusing on a specific NLP use case that has received limited attention: inferring ratings from reviews. Through this use case, we explore the intricate relationship between the performance and interpretability of different models. We introduce a quantitative score called Composite Interpretability (CI) to help visualize the trade-off between interpretability and performance, particularly in the case of composite models. Our results indicate that, in general, the learning performance improves as interpretability decreases, but this relationship is not strictly monotonic, and there are instances where interpretable models are more advantageous.

bert sentiment score, interpretability, sentiment score, (17 more...)

arXiv.org Artificial Intelligence

2503.07914

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

Transcoders Beat Sparse Autoencoders for Interpretability

Paulo, Gonçalo, Shabalin, Stepan, Belrose, Nora

arXiv.org Artificial IntelligenceFeb-12-2025

Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.

artificial intelligence, machine learning, transcoder, (15 more...)

arXiv.org Artificial Intelligence

2501.18823

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Investigating the Effect of Network Pruning on Performance and Interpretability

von Rad, Jonathan, Seuffert, Florian

arXiv.org Artificial IntelligenceSep-29-2024

Deep Neural Networks (DNNs) are often over-parameterized for their tasks and can be compressed quite drastically by removing weights, a process called pruning. We investigate the impact of different pruning techniques on the classification performance and interpretability of GoogLeNet. We systematically apply unstructured and structured pruning, as well as connection sparsity (pruning of input weights) methods to the network and analyze the outcomes regarding the network's performance on the validation set of ImageNet. We also compare different retraining strategies, such as iterative pruning and one-shot pruning. We find that with sufficient retraining epochs, the performance of the networks can approximate the performance of the default GoogLeNet - and even surpass it in some cases. To assess interpretability, we employ the Mechanistic Interpretability Score (MIS) developed by Zimmermann et al. . Our experiments reveal that there is no significant relationship between interpretability and pruning rate when using MIS as a measure. Additionally, we observe that networks with extremely low accuracy can still achieve high MIS scores, suggesting that the MIS may not always align with intuitive notions of interpretability, such as understanding the basis of correct decisions.

artificial intelligence, machine learning, pruning, (15 more...)

arXiv.org Artificial Intelligence

2409.19727

Country:

North America > Puerto Rico > San Juan > San Juan (0.05)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.05)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Dilated Convolution with Learnable Spacings makes visual models more aligned with humans: a Grad-CAM study

Chamas, Rabih, Khalfaoui-Hassani, Ismail, Masquelier, Timothee

arXiv.org Artificial IntelligenceAug-6-2024

Dilated Convolution with Learnable Spacing (DCLS) is a recent advanced convolution method that allows enlarging the receptive fields (RF) without increasing the number of parameters, like the dilated convolution, yet without imposing a regular grid. DCLS has been shown to outperform the standard and dilated convolutions on several computer vision benchmarks. Here, we show that, in addition, DCLS increases the models' interpretability, defined as the alignment with human visual strategies. To quantify it, we use the Spearman correlation between the models' GradCAM heatmaps and the ClickMe dataset heatmaps, which reflect human visual attention. We took eight reference models - ResNet50, ConvNeXt (T, S and B), CAFormer, ConvFormer, and FastViT (sa 24 and 36) - and drop-in replaced the standard convolution layers with DCLS ones. This improved the interpretability score in seven of them. Moreover, we observed that Grad-CAM generated random heatmaps for two models in our study: CAFormer and ConvFormer models, leading to low interpretability scores. We addressed this issue by introducing Threshold-Grad-CAM, a modification built on top of Grad-CAM that enhanced interpretability across nearly all models. The code and checkpoints to reproduce this study are available at: https://github.com/rabihchamas/DCLS-GradCAM-Eval.

dcl, heatmap, interpretability score, (13 more...)

arXiv.org Artificial Intelligence

2408.03164

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)

Genre: Research Report > New Finding (0.89)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Exploring Variational Auto-Encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

Bryan-Kinns, Nick, Zhang, Bingyuan, Zhao, Songyan, Banar, Berker

arXiv.org Artificial IntelligenceNov-14-2023

Generative AI models for music and the arts in general are increasingly complex and hard to understand. The field of eXplainable AI (XAI) seeks to make complex and opaque AI models such as neural networks more understandable to people. One approach to making generative AI models more understandable is to impose a small number of semantically meaningful attributes on generative AI models. This paper contributes a systematic examination of the impact that different combinations of Variational Auto-Encoder models (MeasureVAE and AdversarialVAE), configurations of latent space in the AI model (from 4 to 256 latent dimensions), and training datasets (Irish folk, Turkish folk, Classical, and pop) have on music generation performance when 2 or 4 meaningful musical attributes are imposed on the generative model. To date there have been no systematic comparisons of such models at this level of combinatorial detail. Our findings show that MeasureVAE has better reconstruction performance than AdversarialVAE which has better musical attribute independence. Results demonstrate that MeasureVAE was able to generate music across music genres with interpretable musical dimensions of control, and performs best with low complexity music such a pop and rock. We recommend that a 32 or 64 latent dimensional space is optimal for 4 regularised dimensions when using MeasureVAE to generate music across genres. Our results are the first detailed comparisons of configurations of state-of-the-art generative AI models for music and can be used to help select and configure AI models, musical features, and datasets for more understandable generation of music.

dataset, dimension, latent space, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/s11633-023-1457-1

2311.08336

Country:

North America > United States > New York > New York County > New York City (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.85)

Add feedback